Skip to content

feat: wrap provider.chat in llm.call span with timing and tokens#262

Merged
l50 merged 2 commits into
mainfrom
feat/telemetry-llm-call-span
May 8, 2026
Merged

feat: wrap provider.chat in llm.call span with timing and tokens#262
l50 merged 2 commits into
mainfrom
feat/telemetry-llm-call-span

Conversation

@l50
Copy link
Copy Markdown
Contributor

@l50 l50 commented May 7, 2026

Key Changes:

  • Wrapped each provider.chat() call inside call_with_retry in its own llm.call info span so timing and token usage are attributed to the attempt that produced them
  • Captured per-attempt input, output, and cache token counts, duration, stop reason, and error message as span fields
  • Recorded task.id, llm.model, llm.attempt, llm.tool_count, and llm.message_count at span creation for filterable Tempo queries

Added:

  • Per-attempt llm.call info_span! in ares-llm/src/agent_loop/retry.rs with Empty placeholders for fields that are only known after the call returns (llm.input_tokens, llm.output_tokens, llm.cache_read_tokens, llm.cache_creation_tokens, llm.duration_ms, llm.stop_reason, llm.error)
  • Wall-clock duration measurement via std::time::Instant recorded into llm.duration_ms so retry waits are not folded into the successful call's latency
  • tracing::Instrument instrumentation of the provider.chat() future so async work runs inside the span context

Changed:

  • ares-llm/src/agent_loop/retry.rs use line now imports std::time::Instant plus tracing::{field::Empty, info_span, Instrument} alongside the existing warn
  • Result handling in call_with_retry was split: the call result is first inspected to record token usage / stop reason / error on the span, then the existing retry decision logic runs on that same result

…tokens

Without per-call attribution there was no way to tell whether a slow
agent loop was burning time inside provider.chat (network/LLM) or
between calls (tool dispatch, queue waits). Token spend was visible
only on the session log, not in Tempo traces.

Open one `llm.call` span per retry attempt around `provider.chat`.
After the call returns, record duration_ms plus the four token-usage
counters and stop_reason as span attributes; on error, record the
formatted error. Each retry gets its own span so a 429 backoff does
not inflate the duration attributed to the eventual successful call.
The span also carries `task.id`, `llm.model`, `llm.attempt`, and the
request shape (tool/message counts) so Tempo queries can isolate
slow calls without joining other spans.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@l50 l50 changed the title chore: update GitHub Actions dependencies and improve LLM call tracing feat: wrap provider.chat in llm.call span with timing and tokens May 7, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 7, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 75.11%. Comparing base (60b2915) to head (5b67e1b).
⚠️ Report is 9 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #262   +/-   ##
=======================================
  Coverage   75.10%   75.11%           
=======================================
  Files         383      383           
  Lines       81465    81492   +27     
=======================================
+ Hits        61187    61214   +27     
  Misses      20278    20278           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@l50 l50 merged commit 205ae6f into main May 8, 2026
11 checks passed
@l50 l50 deleted the feat/telemetry-llm-call-span branch May 8, 2026 01:06
l50 added a commit that referenced this pull request May 9, 2026
**Key Changes:**

- Wrapped each `provider.chat()` call inside `call_with_retry` in its
own `llm.call` info span so timing and token usage are attributed to the
attempt that produced them
- Captured per-attempt input, output, and cache token counts, duration,
stop reason, and error message as span fields
- Recorded `task.id`, `llm.model`, `llm.attempt`, `llm.tool_count`, and
`llm.message_count` at span creation for filterable Tempo queries

**Added:**

- Per-attempt `llm.call` `info_span!` in
`ares-llm/src/agent_loop/retry.rs` with `Empty` placeholders for fields
that are only known after the call returns (`llm.input_tokens`,
`llm.output_tokens`, `llm.cache_read_tokens`,
`llm.cache_creation_tokens`, `llm.duration_ms`, `llm.stop_reason`,
`llm.error`)
- Wall-clock duration measurement via `std::time::Instant` recorded into
`llm.duration_ms` so retry waits are not folded into the successful
call's latency
- `tracing::Instrument` instrumentation of the `provider.chat()` future
so async work runs inside the span context

**Changed:**

- `ares-llm/src/agent_loop/retry.rs` `use` line now imports
`std::time::Instant` plus `tracing::{field::Empty, info_span,
Instrument}` alongside the existing `warn`
- Result handling in `call_with_retry` was split: the call result is
first inspected to record token usage / stop reason / error on the span,
then the existing retry decision logic runs on that same result
l50 added a commit that referenced this pull request May 9, 2026
**Key Changes:**

- Wrapped each `provider.chat()` call inside `call_with_retry` in its
own `llm.call` info span so timing and token usage are attributed to the
attempt that produced them
- Captured per-attempt input, output, and cache token counts, duration,
stop reason, and error message as span fields
- Recorded `task.id`, `llm.model`, `llm.attempt`, `llm.tool_count`, and
`llm.message_count` at span creation for filterable Tempo queries

**Added:**

- Per-attempt `llm.call` `info_span!` in
`ares-llm/src/agent_loop/retry.rs` with `Empty` placeholders for fields
that are only known after the call returns (`llm.input_tokens`,
`llm.output_tokens`, `llm.cache_read_tokens`,
`llm.cache_creation_tokens`, `llm.duration_ms`, `llm.stop_reason`,
`llm.error`)
- Wall-clock duration measurement via `std::time::Instant` recorded into
`llm.duration_ms` so retry waits are not folded into the successful
call's latency
- `tracing::Instrument` instrumentation of the `provider.chat()` future
so async work runs inside the span context

**Changed:**

- `ares-llm/src/agent_loop/retry.rs` `use` line now imports
`std::time::Instant` plus `tracing::{field::Empty, info_span,
Instrument}` alongside the existing `warn`
- Result handling in `call_with_retry` was split: the call result is
first inspected to record token usage / stop reason / error on the span,
then the existing retry decision logic runs on that same result
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant